<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<title>Flexible and Economical UTF-8 Decoder</title>

<style type="text/css">
  td span { color: green }
  td strong { color: #ccc; }
  td, th { text-align: center }
  th { padding-left: 1em }
  strong { font-style: normal }
  u { font-weight:900; text-decoration: none }
  p.img img { display:block; }
  pre, table { margin-left: 3em; }
  p.img img { margin-left: -40px; }
  code.comment { color: green }
  code.key { color: blue }
  table td { padding-left: 4px; padding-right: 4px }
  table { margin-top: 2ex }
  body { padding-left: 5em; margin-left: 1em }
  .perf th { text-align: left }
  .perf thead th { background-color: white; font-weight: bold } 
  .perf th { font-weight: normal; background-color: #ddd }
  .perf td { text-align: right }
  .mine th { background-color: #e7e7e7 }
</style>
</head>
<body>
<h1>Flexible and Economical UTF-8 Decoder</h1>
<p>Systems with elaborate Unicode support usually confront programmers 
with a multitude of different functions and macros to process UTF-8 
encoded strings, often with different ideas on handling buffer 
boundaries, state between calls, error conditions, and performance 
characteristics, making them difficult to use correctly and efficiently.
 Implementations also tend to be very long and complicated; one popular 
library has over 500 lines of code just for one version of the decoder. 
This page presents one that is very easy to use correctly, short, small,
 fast, and free.</p>
<h2 id="implementation">Implementation in C (C99)</h2>
<pre><code class="comment">// Copyright (c) 2008-2009 Bjoern Hoehrmann &lt;bjoern@hoehrmann.de&gt;</code>
<code class="comment">// See https://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for details.</code>

<code class="key">#define</code> UTF8_ACCEPT 0
<code class="key">#define</code> UTF8_REJECT 1

<code class="key">static const</code> uint8_t utf8d[] = {
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, <code class="comment">// 00..1f</code>
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, <code class="comment">// 20..3f</code>
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, <code class="comment">// 40..5f</code>
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, <code class="comment">// 60..7f</code>
  1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9, <code class="comment">// 80..9f</code>
  7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, <code class="comment">// a0..bf</code>
  8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, <code class="comment">// c0..df</code>
  0xa,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x4,0x3,0x3, <code class="comment">// e0..ef</code>
  0xb,0x6,0x6,0x6,0x5,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8, <code class="comment">// f0..ff</code>
  0x0,0x1,0x2,0x3,0x5,0x8,0x7,0x1,0x1,0x1,0x4,0x6,0x1,0x1,0x1,0x1, <code class="comment">// s0..s0</code>
  1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,0,1,0,1,1,1,1,1,1, <code class="comment">// s1..s2</code>
  1,2,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1, <code class="comment">// s3..s4</code>
  1,2,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,3,1,1,1,1,1,1, <code class="comment">// s5..s6</code>
  1,3,1,1,1,1,1,3,1,3,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1, <code class="comment">// s7..s8</code>
};

uint32_t <code class="key">inline</code>
decode(uint32_t* state, uint32_t* codep, uint32_t byte) {
  uint32_t type = utf8d[byte];

  *codep = (*state != UTF8_ACCEPT) ?
    (byte &amp; 0x3fu) | (*codep &lt;&lt; 6) :
    (0xff &gt;&gt; type) &amp; (byte);

  *state = utf8d[256 + *state*16 + type];
  <code class="key">return</code> *state;
}
</pre>
<h2 id="usage">Usage</h2>
<p>UTF-8 is a variable length character encoding. To decode a character one or more bytes have to be read from a string. The <code>decode</code>
 function implements a single step in this process. It takes two 
parameters maintaining state and a byte, and returns the state achieved 
after processing the byte. Specifically, it returns the value <code>UTF8_ACCEPT</code> (0) if enough bytes have been read for a character, <code>UTF8_REJECT</code> (1) if the byte is not allowed to occur at its position, and some other positive value if more bytes have to be read.</p>
<p>When decoding the first byte of a string, the caller must set the state variable to <code>UTF8_ACCEPT</code>. If, after decoding one or more bytes the state <code>UTF8_ACCEPT</code> is reached again, then the decoded Unicode character value is available through the <code>codep</code> parameter. If the state <code>UTF8_REJECT</code>
 is entered, that state will never be exited unless the caller 
intervenes. See the examples below for more information on usage and 
error handling, and the section on implementation details for how the 
decoder is constructed.</p>
<h2 id="examples">Examples</h2>
<h3>Validating and counting characters</h3>
<p>This function checks if a null-terminated string is a well-formed 
UTF-8 sequence and counts how many code points are in the string.</p>
<pre><code class="key">int</code>
countCodePoints(uint8_t* s, size_t* count) {
  uint32_t codepoint;
  uint32_t state = 0;

  <code class="key">for</code> (*count = 0; *s; ++s)
    <code class="key">if</code> (!decode(&amp;state, &amp;codepoint, *s))
      *count += 1;

  <code class="key">return</code> state != UTF8_ACCEPT;
}
</pre>
<p>It could be used like so:</p>
<pre><code class="key">if</code> (countCodePoints(s, &amp;count)) {
  printf("The string is malformed\n");
} <code class="key">else</code> {
  printf("The string is %u characters long\n", count);
}
</pre>
<h3>Printing code point values</h3>
<p>This function prints out all code points in the string and an error 
message if unexpected bytes are encountered, or if the string ends with 
an incomplete sequence.</p>
<pre><code class="key">void</code>
printCodePoints(uint8_t* s) {
  uint32_t codepoint;
  uint32_t state = 0;

  <code class="key">for</code> (; *s; ++s)
    <code class="key">if</code> (!decode(&amp;state, &amp;codepoint, *s))
      printf("U+%04X\n", codepoint);

  <code class="key">if</code> (state != UTF8_ACCEPT)
    printf("The string is not well-formed\n");

}
</pre>
<h3>Printing UTF-16 code units</h3>
<p>This loop prints out UTF-16 code units for the characters in a null-terminated UTF-8 encoded string.</p>
<pre><code class="key">for</code> (; *s; ++s) {

  <code class="key">if</code> (decode(&amp;state, &amp;codepoint, *s))
    <code class="key">continue</code>;

  <code class="key">if</code> (codepoint &lt;= 0xFFFF) {
    printf("0x%04X\n", codepoint);
    <code class="key">continue</code>;
  }

  <code class="comment">// Encode code points above U+FFFF as surrogate pair.</code>
  printf("0x%04X\n", (0xD7C0 + (codepoint &gt;&gt; 10)));
  printf("0x%04X\n", (0xDC00 + (codepoint &amp; 0x3FF)));
}
</pre>
<h3>Error recovery</h3>
<p>It is sometimes desireable to recover from errors when decoding 
strings that are supposed to be UTF-8 encoded. Programmers should be 
aware that this can negatively affect the security properties of their 
application. A common recovery method is to replace malformed sequences 
with a substitute character like <code>U+FFFD REPLACEMENT CHARACTER</code>.</p>
<p>Decoder implementations differ in which octets they replace and where they restart. Consider for instance the sequence <code>0xED 0xA0 0x80</code>.
 It encodes a surrogate code point which is prohibited in UTF-8. A 
recovering decoder may replace the whole sequence and restart with the 
next byte, or it may replace the first byte and restart with the second 
byte, replace it, restart with the third, and replace the third byte 
aswell.</p>
<p>The following code implements one such recovery strategy. When an 
unexpected byte is encountered, the sequence up to that point will be 
replaced and, if the error occured in the middle of a sequence, will 
retry the byte as if it occured at the beginning of a string. Note that 
the decode function detects errors as early as possible, so the sequence
 <code>0xED 0xA0 0x80</code> would result in three replacement characters.</p>
<pre><code class="key">for</code> (prev = 0, current = 0; *s; prev = current, ++s) {

  <code class="key">switch</code> (decode(&amp;current, &amp;codepoint, *s)) {
  <code class="key">case</code> UTF8_ACCEPT:
    <code class="comment">// A properly encoded character has been found.</code>
    printf("U+%04X\n", codepoint);
    <code class="key">break</code>;

  <code class="key">case</code> UTF8_REJECT:
    <code class="comment">// The byte is invalid, replace it and restart.</code>
    printf("U+FFFD (Bad UTF-8 sequence)\n");
    current = UTF8_ACCEPT;
    <code class="key">if</code> (prev != UTF8_ACCEPT)
      s--;
    <code class="key">break</code>;<!--

  <code class=key>default</code>:
    <code class=comment>// The byte is valid, more bytes have to be read.</code>
    <code class=key>break</code>;
  }-->
  ...
</pre>
<p>For some recovery strategies it may be useful to determine the number
 of bytes expected. The states in the automaton are numbered such that, 
assuming C's division operator, <code>state / 3 + 1</code> is that number. Of course, this will only work for states other than <code>UTF8_ACCEPT</code> and <code>UTF8_REJECT</code>. This number could then be used, for instance, to skip the continuation octets in the illegal sequence <code>0xED 0xA0 0x80</code> so it will be replaced by a single replacement character.</p>
<h3>Transcoding to UTF-16 buffer</h3>
<p>This is a rough outline of a UTF-16 transcoder. Actual applications 
would add code for error reporting, reporting of words written, required
 buffer size in the case of a small buffer, and possibly other things. 
Note that in order to avoid checking for free space in the inner loop, 
we determine how many bytes can be read without running out of space. 
This is one utf-8 byte per available utf-16 word, with one exception: if
 the last byte read was the third byte in a four byte sequence we would 
get two words for the next byte; so we read one byte less than we have 
words available. This additional word is also needed for 
null-termination, so it's never wrong to read one less.</p>
<pre><code class="key">int</code>
toUtf16(uint8_t* src, size_t srcBytes, uint16_t* dst, size_t dstWords, ...) {

  uint8_t* src_actual_end = src + srcBytes;
  uint8_t* s = src;
  uint16_t* d = dst;
  uint32_t codepoint;
  uint32_t state = 0;

  <code class="key">while</code> (s &lt; src_actual_end) {
<!--
    <code class="comment">// To avoid checking for free space in the inner loop, we determine</code>
    <code class="comment">// how many bytes can be read without running out of space. This is</code>
    <code class="comment">// one utf-8 byte per available utf-16 word, with one exception: if</code>
    <code class="comment">// the last byte read was the third byte in a four byte sequence we</code>
    <code class="comment">// would get two words for the next byte; so we read one less. This</code>
    <code class="comment">// additional word is also needed for null-termination, so it's not</code>
    <code class="comment">// wrong to read one less in any case.</code>

-->
    size_t dst_words_free = dstWords - (d - dst);
    uint8_t* src_current_end = s + dst_words_free - 1;

    <code class="key">if</code> (src_actual_end &lt; src_current_end)
      src_current_end = src_actual_end;

    <code class="key">if</code> (src_current_end &lt;= s)
      <code class="key">goto</code> toosmall;

    <code class="key">while</code> (s &lt; src_current_end) {

      <code class="key">if</code> (decode(&amp;state, &amp;codepoint, *s++))
        <code class="key">continue</code>;

      <code class="key">if</code> (codepoint &gt; 0xffff) {
        *d++ = (uint16_t)(0xD7C0 + (codepoint &gt;&gt; 10));
        *d++ = (uint16_t)(0xDC00 + (codepoint &amp; 0x3FF));
      } <code class="key">else</code> {
        *d++ = (uint16_t)codepoint;
      }
    }
  }

  <code class="key">if</code> (state != UTF8_ACCEPT) {
    ...
  }

  <code class="key">if</code> ((dstWords - (d - dst)) == 0)
    <code class="key">goto</code> toosmall;

  *d++ = 0;
  ...

toosmall:
  ...
}

</pre>
<h2 id="design">Implementation details</h2>
<p>The <code>utf8d</code> table consists of two parts. The first part 
maps bytes to character classes, the second part encodes a deterministic
 finite automaton using these character classes as transitions. This 
section details the composition of the table.</p>
<h3>Canonical UTF-8 automaton</h3>
<p>UTF-8 is a variable length character encoding. That means state has 
to be maintained while processing a string. The following transition 
graph illustrates the process. We start in state zero, and whenever we 
come back to it, we've seen a whole Unicode character. Transitions not 
in the graph are disallowed; they all lead to state one, which has been 
omitted for readability.</p>
<p class="img"><img src="dfa-bytes.png" alt="DFA with range transitions"></p>
<h3>Automaton with character class transitions</h3>
<p>The byte ranges in the transition graph above are not easily encoded 
in the automaton in a manner that would allow fast lookup. Instead of 
encoding the ranges directly, the ranges are split such that each byte 
belongs to exactly one character class. Then the transitions go over 
these character classes.</p>
<p class="img"><img src="dfa-classes.png" alt="DFA with class transitions"></p>
<h3>Mapping bytes to character classes</h3>
<p>Primarily to save space in the transition table, bytes are mapped to character classes. This is the mapping:</p>
<table>
<tbody><tr>
<th>00..7f</th>
<td>0</td>
<th>80..8f</th>
<td>1</td>
</tr>
<tr>
<th>90..9f</th>
<td>9</td>
<th>a0..bf</th>
<td>7</td>
</tr>
<tr>
<th>c0..c1</th>
<td>8</td>
<th>c2..df</th>
<td>2</td>
</tr>
<tr>
<th>e0..e0</th>
<td>10</td>
<th>e1..ec</th>
<td>3</td>
</tr>
<tr>
<th>ed..ed</th>
<td>4</td>
<th>ee..ef</th>
<td>3</td>
</tr>
<tr>
<th>f0..f0</th>
<td>11</td>
<th>f1..f3</th>
<td>6</td>
</tr>
<tr>
<th>f4..f4</th>
<td>5</td>
<th>f5..ff</th>
<td>8</td>
</tr>
</tbody></table>
<p>For bytes that may occur at the beginning of a multibyte sequence, 
the character class number is also used to remove the most significant 
bits from the byte, which do not contribute to the actual code point 
value. Note that <code>0xc0</code>, <code>0xc1</code>, and <code>0xf5</code> .. <code>0xff</code>
 have all their bits removed. These bytes cannot occur in well-formed 
sequences, so it does not matter which bits, if any, are retained.</p>
<table>
<tbody><tr>
<th>c0</th>
<td>8</td>
<td><strong>11000000</strong></td>
<th>d0</th>
<td>2</td>
<td><strong>11</strong><span>0<u>1</u>0000</span></td>
<th>e0</th>
<td>10</td>
<td><strong>11100000</strong></td>
<th>f0</th>
<td>11</td>
<td><strong>11110000</strong></td>
</tr>
<tr>
<th>c1</th>
<td>8</td>
<td><strong>1100000<u>1</u></strong></td>
<th>d1</th>
<td>2</td>
<td><strong>11</strong><span>0<u>1</u>0001</span></td>
<th>e1</th>
<td>3</td>
<td><strong>111</strong><span>0000<u>1</u></span></td>
<th>f1</th>
<td>6</td>
<td><strong>111100</strong><span>0<u>1</u></span></td>
</tr>
<tr>
<th>c2</th>
<td>2</td>
<td><strong>11</strong><span>0000<u>1</u>0</span></td>
<th>d2</th>
<td>2</td>
<td><strong>11</strong><span>0<u>1</u>0010</span></td>
<th>e2</th>
<td>3</td>
<td><strong>111</strong><span>000<u>1</u>0</span></td>
<th>f2</th>
<td>6</td>
<td><strong>111100</strong><span><u>1</u>0</span></td>
</tr>
<tr>
<th>c3</th>
<td>2</td>
<td><strong>11</strong><span>0000<u>1</u>1</span></td>
<th>d3</th>
<td>2</td>
<td><strong>11</strong><span>0<u>1</u>0011</span></td>
<th>e3</th>
<td>3</td>
<td><strong>111</strong><span>000<u>1</u>1</span></td>
<th>f3</th>
<td>6</td>
<td><strong>111100</strong><span><u>1</u>1</span></td>
</tr>
<tr>
<th>c4</th>
<td>2</td>
<td><strong>11</strong><span>000<u>1</u>00</span></td>
<th>d4</th>
<td>2</td>
<td><strong>11</strong><span>0<u>1</u>0100</span></td>
<th>e4</th>
<td>3</td>
<td><strong>111</strong><span>00<u>1</u>00</span></td>
<th>f4</th>
<td>5</td>
<td><strong>11110</strong><span><u>1</u>00</span></td>
</tr>
<tr>
<th>c5</th>
<td>2</td>
<td><strong>11</strong><span>000<u>1</u>01</span></td>
<th>d5</th>
<td>2</td>
<td><strong>11</strong><span>0<u>1</u>0101</span></td>
<th>e5</th>
<td>3</td>
<td><strong>111</strong><span>00<u>1</u>01</span></td>
<th>f5</th>
<td>8</td>
<td><strong>11110<u>1</u>01</strong></td>
</tr>
<tr>
<th>c6</th>
<td>2</td>
<td><strong>11</strong><span>000<u>1</u>10</span></td>
<th>d6</th>
<td>2</td>
<td><strong>11</strong><span>0<u>1</u>0110</span></td>
<th>e6</th>
<td>3</td>
<td><strong>111</strong><span>00<u>1</u>10</span></td>
<th>f6</th>
<td>8</td>
<td><strong>11110<u>1</u>10</strong></td>
</tr>
<tr>
<th>c7</th>
<td>2</td>
<td><strong>11</strong><span>000<u>1</u>11</span></td>
<th>d7</th>
<td>2</td>
<td><strong>11</strong><span>0<u>1</u>0111</span></td>
<th>e7</th>
<td>3</td>
<td><strong>111</strong><span>00<u>1</u>11</span></td>
<th>f7</th>
<td>8</td>
<td><strong>11110<u>1</u>11</strong></td>
</tr>
<tr>
<th>c8</th>
<td>2</td>
<td><strong>11</strong><span>00<u>1</u>000</span></td>
<th>d8</th>
<td>2</td>
<td><strong>11</strong><span>0<u>1</u>1000</span></td>
<th>e8</th>
<td>3</td>
<td><strong>111</strong><span>0<u>1</u>000</span></td>
<th>f8</th>
<td>8</td>
<td><strong>11111000</strong></td>
</tr>
<tr>
<th>c9</th>
<td>2</td>
<td><strong>11</strong><span>00<u>1</u>001</span></td>
<th>d9</th>
<td>2</td>
<td><strong>11</strong><span>0<u>1</u>1001</span></td>
<th>e9</th>
<td>3</td>
<td><strong>111</strong><span>0<u>1</u>001</span></td>
<th>f9</th>
<td>8</td>
<td><strong>1111100<u>1</u></strong></td>
</tr>
<tr>
<th>ca</th>
<td>2</td>
<td><strong>11</strong><span>00<u>1</u>010</span></td>
<th>da</th>
<td>2</td>
<td><strong>11</strong><span>0<u>1</u>1010</span></td>
<th>ea</th>
<td>3</td>
<td><strong>111</strong><span>0<u>1</u>010</span></td>
<th>fa</th>
<td>8</td>
<td><strong>111110<u>1</u>0</strong></td>
</tr>
<tr>
<th>cb</th>
<td>2</td>
<td><strong>11</strong><span>00<u>1</u>011</span></td>
<th>db</th>
<td>2</td>
<td><strong>11</strong><span>0<u>1</u>1011</span></td>
<th>eb</th>
<td>3</td>
<td><strong>111</strong><span>0<u>1</u>011</span></td>
<th>fb</th>
<td>8</td>
<td><strong>111110<u>1</u>1</strong></td>
</tr>
<tr>
<th>cc</th>
<td>2</td>
<td><strong>11</strong><span>00<u>1</u>100</span></td>
<th>dc</th>
<td>2</td>
<td><strong>11</strong><span>0<u>1</u>1100</span></td>
<th>ec</th>
<td>3</td>
<td><strong>111</strong><span>0<u>1</u>100</span></td>
<th>fc</th>
<td>8</td>
<td><strong>11111100</strong></td>
</tr>
<tr>
<th>cd</th>
<td>2</td>
<td><strong>11</strong><span>00<u>1</u>101</span></td>
<th>dd</th>
<td>2</td>
<td><strong>11</strong><span>0<u>1</u>1101</span></td>
<th>ed</th>
<td>4</td>
<td><strong>1110</strong><span><u>1</u>101</span></td>
<th>fd</th>
<td>8</td>
<td><strong>1111110<u>1</u></strong></td>
</tr>
<tr>
<th>ce</th>
<td>2</td>
<td><strong>11</strong><span>00<u>1</u>110</span></td>
<th>de</th>
<td>2</td>
<td><strong>11</strong><span>0<u>1</u>1110</span></td>
<th>ee</th>
<td>3</td>
<td><strong>111</strong><span>0<u>1</u>110</span></td>
<th>fe</th>
<td>8</td>
<td><strong>11111110</strong></td>
</tr>
<tr>
<th>cf</th>
<td>2</td>
<td><strong>11</strong><span>00<u>1</u>111</span></td>
<th>df</th>
<td>2</td>
<td><strong>11</strong><span>0<u>1</u>1111</span></td>
<th>ef</th>
<td>3</td>
<td><strong>111</strong><span>0<u>1</u>111</span></td>
<th>ff</th>
<td>8</td>
<td><strong>11111111</strong></td>
</tr>
</tbody></table>
<h2 id="variations">Notes on Variations</h2>
<p>There are several ways to change the implementation of this decoder. 
For example, the size of the data table can be reduced, at the cost of a
 couple more instructions, so it omits the mapping of bytes in the 
US-ASCII range, and since all entries in the table are 4 bit values, two
 values could be stored in a single byte.</p>
<p>In some situations it may be beneficial to have a separate start 
state. This is easily achieved by copying the s0 state in the array to 
the end, and using the new state 9 as start state as needed.</p>
<p>Where callers require the code point values, compilers tend to 
generate slightly better code if the state calculation is moved into the
 branches, for example</p>
<pre><code class="key">if</code> (*state != UTF8_ACCEPT) {
  *state = utf8d[256 + *state*16 + type];
  *codep = (*codep &lt;&lt; 6) | (byte &amp; 63);
} <code class="key">else</code> {
  *state = utf8d[256 + *state*16 + type];
  *codep = (byte) &amp; (255 &gt;&gt; type);
}
</pre>
<p>As the state will be zero in the else branch, this saves a shift and 
an addition for each starter. Unfortunately, compilers will then 
typically generate worse code if the codepoint value is not needed. 
Naturally, then, two functions could be used, one that only calculates 
the states for validation, counting, and similar applications, and one 
for full decoding. For the sample UTF-16 transcoder a more substantial 
increase in performance can be achieved by manually including the decode
 code in the inner loop; then it is also worthwhile to make code points 
in the US-ASCII range a special case:</p>
<pre><code class="key">while</code> (s &lt; src_current_end) {

  uint32_t byte = *s++;
  uint32_t type = utf8d[byte];

  <code class="key">if</code> (state != UTF8_ACCEPT) {
    codep = (codep &lt;&lt; 6) | (byte &amp; 63);
    state = utf8d[256 + state*16 + type];

    <code class="key">if</code> (state)
      <code class="key">continue</code>;

  } <code class="key">else if</code> (byte &gt; 0x7f) {
    codep = (byte) &amp; (255 &gt;&gt; type);
    state = utf8d[256 + type];
    <code class="key">continue</code>;

  } <code class="key">else</code> {
    *d++ = (uint16_t)byte;
    <code class="key">continue</code>;
  }
  ...
</pre>
<p>Another variation worth of note is changing the comparison when setting the code point value to this:</p>
<pre>*codep = (*state &gt;  UTF8_REJECT) ?
  (byte &amp; 0x3fu) | (*codep &lt;&lt; 6) :
  (0xff &gt;&gt; type) &amp; (byte);
</pre>
<p>This ensures that the code point value does not exceed the value <code>0xff</code> after some malformed sequence is encountered.</p>
<p>As written, the decoder disallows encoding of surrogate code points, 
overlong 2, 3, and 4 byte sequences, and 4 byte sequences outside the 
Unicode range. Allowing them can have serious security implications, but
 can easily be achieved by changing the character class assignments in 
the table.</p>
<p>The code samples have generally been written to perform well on my 
system when compiled with Visual C++ 7.1 and GCC 3.4.5. Slight changes 
may improve performance, for example, Visual C++ 7.1 will produce 
slightly faster code when, in the manually inlined version of the 
transcoder discussed above, the type assignment is moved into the 
branches where it is needed, and the state and codepoint assignments in 
the non-ASCII starter is swapped (approximately a 5% increase), but GCC 
3.4.5 will produce considerably slower code (approximately 10%).</p>
<p>I have experimented with various rearrangements of states and character classes. A seemingly promising one is the following:</p>
<p class="img"><img src="dfa-rearr.png" alt="Re-arranged DFA with class transitions"></p>
<p>One of the continuation ranges has been split into two, the other 
changes are just renamings. This arrangement allows, when a continuation
 octet is expected, to compute the character class with a shift instead 
of a table lookup, and when looking at a non-ASCII starter, the next 
state is simply the character class. On my system the change in 
performance is in the area of +/- 1%. This encoding would have a number 
of downsides: more rejecting states are required to account for 
continuation octets where starters are expected, the table formatting 
would use more hex notation making it longer, and calculating the number
 of expected continuation octets from a given state is more difficult. 
One thing I'd still like to try out is if, perhaps by adding a couple of
 additional states, for continuation states the next state can be 
computed without any table lookup with a few easily paired instructions.</p>

<p>On 24th June 2010 Rich Felker pointed out that the state values in 
the transition table can be pre-multiplied with 16 which would save a 
shift instruction for every byte. D'oh! 
We actually just need 12 and can throw away the filler values previously
 in the table making the table 36 bytes shorter and save the shift in 
the code.
</p>
<pre><code class="comment">// Copyright (c) 2008-2010 Bjoern Hoehrmann &lt;bjoern@hoehrmann.de&gt;</code>
<code class="comment">// See https://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for details.</code>

<code class="key">#define</code> UTF8_ACCEPT 0
<code class="key">#define</code> UTF8_REJECT 12

<code class="key">static const</code> uint8_t utf8d[] = {
  <code class="comment">// The first part of the table maps bytes to character classes that</code>
  <code class="comment">// to reduce the size of the transition table and create bitmasks.</code>
   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
   1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,  9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,
   7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,  7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
   8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
  10,3,3,3,3,3,3,3,3,3,3,3,3,4,3,3, 11,6,6,6,5,8,8,8,8,8,8,8,8,8,8,8,

  <code class="comment">// The second part is a transition table that maps a combination</code>
  <code class="comment">// of a state of the automaton and a character class to a state.</code>
   0,12,24,36,60,96,84,12,12,12,48,72, 12,12,12,12,12,12,12,12,12,12,12,12,
  12, 0,12,12,12,12,12, 0,12, 0,12,12, 12,24,12,12,12,12,12,24,12,24,12,12,
  12,12,12,12,12,12,12,24,12,12,12,12, 12,24,12,12,12,12,12,12,12,24,12,12,
  12,12,12,12,12,12,12,36,12,36,12,12, 12,36,12,12,12,12,12,36,12,36,12,12,
  12,36,12,12,12,12,12,12,12,12,12,12, 
};

uint32_t <code class="key">inline</code>
decode(uint32_t* state, uint32_t* codep, uint32_t byte) {
  uint32_t type = utf8d[byte];

  *codep = (*state != UTF8_ACCEPT) ?
    (byte &amp; 0x3fu) | (*codep &lt;&lt; 6) :
    (0xff &gt;&gt; type) &amp; (byte);

  *state = utf8d[256 + *state + type];
  <code class="key">return</code> *state;
}
</pre>
<h2 id="performance">Notes on performance</h2>
<p>To conduct some ad-hoc performance testing I've used three different 
UTF-8 encoded buffers and passed them through a couple of UTF-8 to 
UTF-16 transcoders. The large buffer is a April 2009 Hindi Wikipedia 
article XML dump, the medium buffer Markus Kuhn's UTF-8-demo.txt, and 
the tiny buffer my name, each about the number of times required for 
about 1GB of data. All tests ran on a <a href="https://en.wikipedia.org/wiki/Celeron#Prescott-256">Intel Prescott Celeron</a> at 2666 MHz. See <a href="#changes">Changes</a> for some additional details.</p>
<table class="perf">
<thead>
<tr>
<th class="empty"></th>
<th>Large</th>
<th>Medium</th>
<th>Tiny</th>
</tr>
</thead>
<tbody><tr>
<th><code>NS_CStringToUTF16()</code> Mozilla 1.9 (<em>includes malloc/free time</em>)</th>
<td>36924ms</td>
<td>39773ms</td>
<td>107958ms</td>
</tr>
<tr>
<th><code>iconv()</code> 1.9 compiled with Visual C++ (Cygwin iconv 1.11 similar)</th>
<td>22740ms</td>
<td>21765ms</td>
<td>32595ms</td>
</tr>
<tr>
<th><code>g_utf8_to_utf16()</code> Cygwin Glib 2.0 (<em>includes malloc/free time</em>)</th>
<td>21599ms</td>
<td>20345ms</td>
<td>98782ms</td>
</tr>
<tr>
<th><code>ConvertUTF8toUTF16()</code> Unicode Inc., Visual C++ 7.1 -Ox -Ot -G7</th>
<td>11183ms</td>
<td>11251ms</td>
<td>19453ms</td>
</tr>
<tr>
<th><code>MultiByteToWideChar()</code> Windows API (Server 2003 SP2)</th>
<td>9857ms</td>
<td>10779ms</td>
<td>12771ms</td>
</tr>
<tr>
<th><code>u_strFromUTF8</code> from ICU 4.0.1 (Visual Studio 2008, web site distribution)</th>
<td>8778ms</td>
<td>5223ms</td>
<td>5419ms</td>
</tr>
<tr>
<th title="Incorrectly allows surrogate code points; modified version that uses pre-allocated buffer, normal version allocates worst case memory and resizes."><code>PyUnicode_DecodeUTF8Stateful</code> (3.1a2), Visual C++ 7.1 -Ox -Ot -G7</th>
<td>4523ms</td>
<td>5686ms</td>
<td>3138ms</td>
</tr>
<tr class="mine first">
<th>Example section transcoder, Visual C++ 7.1 -Ox -Ot -G7</th>
<td>5397ms</td>
<td>5789ms</td>
<td>6250ms</td>
</tr>
<tr class="mine">
<th>Manually inlined transcoder (see above), Visual C++ 7.1 -Ox -Ot -G7</th>
<td>4277ms</td>
<td>4998ms</td>
<td>4640ms</td>
</tr>
<tr class="mine">
<th>Same, Cygwin GCC 3.4.5 -march=prescott -fomit-frame-pointer -O3</th>
<td>4492ms</td>
<td>5154ms</td>
<td>4432ms</td>
</tr>
<tr class="mine">
<th>Same, Cygwin GCC 4.3.2 -march=prescott -fomit-frame-pointer -O3</th>
<td>5439ms</td>
<td>6322ms</td>
<td>5567ms</td>
</tr>
<tr class="mine">
<th>Same, Visual C++ 6.0 -TP -O2</th>
<td>5398ms</td>
<td>6259ms</td>
<td>6446ms</td>
</tr>
<tr class="mine">
<th>Same, Visual C++ 7.1 -Ox -Ot -G7 (<em>includes malloc/free time</em>)</th>
<td>5498ms</td>
<td>5086ms</td>
<td>25852ms</td>
</tr>
</tbody></table>
<p>I have also timed functions that <code>xor</code> all code points in the large buffer. In Visual Studio 2008 ICU's <code>U8_NEXT</code> macro comes out at ~8000ms, the <code>U8_NEXT_UNSAFE</code> macro, which requires complete and well-formed input, at ~4000ms, and the <code>decode</code>
 function is at ~5900ms. Using the same manual inlining as for the 
transcode function, Cygwin GCC 3.4.5 -march=prescott -O3 
-fomit-frame-pointer brings it down to roughly the same times as the 
transcode function for all three buffers.</p>
<p>While these results do not model real-world applications well, it 
seems reasonable to suggest that the reduced complexity does not come at
 the price of reduced performance. Note that instructions that compute 
the code point values will generally be optimized away when not needed. 
For example, checking if a null-terminated string is properly UTF-8 
encoded ...</p>
<pre><code class="key">int</code>
IsUTF8(uint8_t* s) {
  uint32_t codepoint, state = 0;

  <code class="key">while</code> (*s)
    decode(&amp;state, &amp;codepoint, *s++);

  <code class="key">return</code> state == UTF8_ACCEPT;
}
</pre>
<p>... does not require the individual code point values, and so the loop becomes something like this:</p>
<pre>l1: movzx  eax,al
    shl    edx,4
    add    ecx,1
    movzx  eax,byte ptr [eax+404000h]
    movzx  edx,byte ptr [eax+edx+256+404000h]
    movzx  eax,byte ptr [ecx]
    test   al,al
    jne    l1
</pre>
<p>For comparison, this is a typical <code>strlen</code> loop:</p>
<pre>l1: mov    cl,byte ptr [eax] 
    add    eax,1 
    test   cl,cl 
    jne    l1
</pre>
<p>With the large buffer and the same number of times as above, <code>strlen</code> takes 1507ms while <code>IsUTF8</code> takes 2514ms.</p>
<h2 id="license">License</h2>
<div class="license">
<p>Copyright (c) 2008-2009 <a href="https://bjoern.hoehrmann.de/">Bjoern Hoehrmann</a> &lt;<a href="mailto:bjoern@hoehrmann.de">bjoern@hoehrmann.de</a>&gt;</p>
<p>Permission is hereby granted, free of charge, to any person obtaining
 a copy of this software and associated documentation files (the 
"Software"), to deal in the Software without restriction, including 
without limitation the rights to use, copy, modify, merge, publish, 
distribute, sublicense, and/or sell copies of the Software, and to 
permit persons to whom the Software is furnished to do so, subject to 
the following conditions:</p>
<p>The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.</p>
<p>THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY 
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, 
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.</p>
</div>
<h2 id="changes">Changes</h2>
<dl>
<dt>25 Jun 2010</dt>
<dd>Added an improved variation based on an observation from Rich Felker.</dd>
<dt>30 April 2009</dt>
<dd>Added some more items to the performance table: the manually inlined
 transcoder allocating worst case memory for each run and freeing it 
before the next run; and results for Mozilla's NS_CStringToUTF16 (a new 
nsAutoString is created for each run, and truncated before the next). 
This used the XULRunner SDK 1.9.0.7 binary distribution from the Mozilla
 website.</dd>
<dt>18 April 2009</dt>
<dd>Added notes to the Variations section on handling malformed sequences and failed optimization attempts.</dd>
<dt>14 April 2009</dt>
<dd>Added PyUnicode_DecodeUTF8Stateful times; the function has been 
modified slightly so it works outside Python and so it uses a 
pre-allocated buffer. Normally does not check output buffer boundaries 
but rather allocates a worst case buffer, then resizes it. Apparently 
the decoder <a href="https://bugs.python.org/issue3672">allows encodings of surrogate code points</a>.</dd>
</dl>
<h2 id="author">Author</h2>
<address>
<p><a href="https://bjoern.hoehrmann.de/">Björn Höhrmann</a>&nbsp;<a href="mailto:bjoern@hoehrmann.de">bjoern@hoehrmann.de</a> (<a href="http://sourceforge.net/developer/user_donations.php?user_id=188003">Donate via SourceForge</a>, <a href="https://www.paypal.com/cgi-bin/webscr?cmd=_xclick&amp;business=bjoern@hoehrmann.de&amp;item_name=Support+Bjoern+Hoehrmann">PayPal</a>)</p>
</address>
<p><a href="https://bjoern.hoehrmann.de/utf-8/decoder/dfa/">https://bjoern.hoehrmann.de/utf-8/decoder/dfa/</a></p>
</body></html>